2026-06-30

Diffusion Model

Diffusion Models are a class of generative models that simulate the process of data being gradually corrupted by noise (forward diffusion), and then learn the reverse process to recover data from noise.

1. Core Idea

Diffusion models consist of two core processes:

Process	Direction	Description
Forward Diffusion	Data $\to$ Noise	Gradually add Gaussian noise to data until it becomes pure noise
Reverse Denoising	Noise $\to$ Data	Learn to gradually recover original data from noise

[!NOTE] Physical Analogy
Similar to diffusion phenomenon in thermodynamics: a drop of ink in water gradually diffuses until uniformly distributed. The reverse process is “condensing” back to the initial state from uniform distribution.

2. Forward Diffusion Process

2.1 Discrete-Time Formulation (DDPM)

Given data point $x_{0} \sim q (x)$ , gradually add Gaussian noise:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

where $β_{t} \in (0, 1)$ is the noise schedule, typically satisfying $β_{1} < β_{2} < \dots < β_{T}$ .

After $T$ steps, $x_{T}$ approximates standard Gaussian distribution $N (0, I)$ .

Common Noise Schedules:

Schedule	Formula	Characteristics
Linear	$β_{t} = β_{1} + (t - 1) \frac{β_{T} - β_{1}}{T - 1}$	Simple, widely used
Cosine	${\bar{α}}_{t} = \frac{f (t)}{f (0)}$ , $f (t) = \cos {(\frac{t / T + s}{1 + s} \cdot \frac{π}{2})}^{2}$	Better for small $t$
Quadratic	$β_{t} = {(\sqrt{β_{1}} + (t - 1) \frac{\sqrt{β_{T}} - \sqrt{β_{1}}}{T - 1})}^{2}$	Slower initial noise

2.2 Reparameterization Trick

Define $α_{t} = 1 - β_{t}$ , ${\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}$ , then:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I)

Key Property: We can sample $x_{t}$ at any timestep directly without iterating:

q (x_{t} ∣ x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I)

2.3 Continuous-Time Formulation ([[Stochastic Differential Equation (SDE)|SDE]])

The forward process can be written as [[Stochastic Differential Equation (SDE)|Stochastic Differential Equation]]:

d x = f (t) x d t + g (t) d W_{t}

where $W_{t}$ is [[Wiener Process|Wiener Process]], $f (t)$ is the drift coefficient, and $g (t)$ is the diffusion coefficient.

Discrete-Continuous Correspondence:

{\bar{α}}_{t} = \exp (- \int_{0}^{t} β (s) d s)

3. Reverse Denoising Process

3.1 Discrete-Time Formulation

The reverse process is also modeled as Gaussian distribution:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

Learn mean $μ_{θ}$ and variance $Σ_{θ}$ through neural network.

Optimal Reverse Distribution (when $β_{t} \to 0$ ):

q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t}, \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} I)

3.2 Simplified Training Objective (DDPM)

Ho et al. (2020) proposed simplified loss function:

L_{simple} = E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]

$ϵ \sim N (0, I)$ : true noise
$ϵ_{θ} (x_{t}, t)$ : noise predicted by neural network

Full Variational Lower Bound:

L_{VLB} = E_{q} [D_{KL} (q (x_{T} | x_{0}) ∥ p (x_{T})) + \sum_{t = 2}^{T} D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t})) - \log p_{θ} (x_{0} | x_{1})]

3.3 Continuous-Time Formulation (Score-based)

Reverse [[Stochastic Differential Equation (SDE)|SDE]]:

d x = [f (t) x - g (t)^{2} \nabla_{x} \log p_{t} (x)] d t + g (t) d {\bar{W}}_{t}

where ${\bar{W}}_{t}$ is [[Wiener Process|Wiener Process]] in reverse time, and $\nabla_{x} \log p_{t} (x)$ is [[Score Function|Score Function]].

[[Score Function]] Estimation:

The [[Score Function]] is learned via score matching:

L (θ) = \frac{1}{2} E_{t} E_{x_{0}} E_{x_{t} | x_{0}} [∥ s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log q (x_{t} | x_{0}) ∥^{2}]

where $\nabla_{x_{t}} \log q (x_{t} | x_{0}) = - \frac{x_{t} - \sqrt{{\bar{α}}_{t}} x_{0}}{1 - {\bar{α}}_{t}}$ .

4. Core Formula Summary

[!QUOTE] DDPM Forward Noising
$x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I)$

[!QUOTE] DDPM Simplified Loss
$L_{simple} = E_{t, x_{0}, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]$

[!QUOTE] Forward [[Stochastic Differential Equation (SDE)|SDE]]
$d x = f (t) x d t + g (t) d W_{t}$

[!QUOTE] Reverse [[Stochastic Differential Equation (SDE)|SDE]]
$d x = [f (t) x - g (t)^{2} \nabla_{x} \log p_{t} (x)] d t + g (t) d {\bar{W}}_{t}$

[!QUOTE] [[Probability Flow ODE]]
$d x = [f (t) x - \frac{1}{2} g (t)^{2} \nabla_{x} \log p_{t} (x)] d t$

5. Main Variants

Model	Features	Key Contributions
DDPM	Discrete-time, pixel space	Established the basic framework of diffusion models
DDIM	Deterministic sampling, accelerated generation	Non-Markovian forward process, supports skip-step sampling
Score [[Stochastic Differential Equation (SDE)\|SDE]]	Continuous-time [[Stochastic Differential Equation (SDE)\|SDE]] framework	Unified DDPM and Score Matching
LDM	Latent space diffusion	Perform diffusion in VAE latent space, reducing computation
DiT	Transformer architecture	Use Transformer instead of U-Net
EDM	Improved design choices	Better architecture, sampling, and training
Stable Diffusion	Text-conditional LDM	Cross-attention for text guidance, widely adopted

5.1 DDIM (Denoising Diffusion Implicit Models)

DDIM generalizes DDPM to non-Markovian processes:

x_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}, t)}{\sqrt{{\bar{α}}_{t}}}) + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} \cdot ϵ_{θ} (x_{t}, t) + σ_{t} z

where $σ_{t} = η \sqrt{\frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}}} \sqrt{1 - \frac{{\bar{α}}_{t}}{{\bar{α}}_{t - 1}}}$ , and $η \in [0, 1]$ :

$η = 1$ : DDPM (stochastic)
$η = 0$ : DDIM (deterministic)

Key Advantage: Can use fewer timesteps (e.g., 50 instead of 1000) for faster sampling.

5.2 LDM (Latent Diffusion Models)

Instead of diffusing in pixel space, LDM operates in latent space:

Compress: $z = E (x)$ using VAE encoder
Diffuse: Apply diffusion process to $z$
Decode: $\hat{x} = D (z_{0})$ using VAE decoder

Benefits:

Lower dimensionality (e.g., $64 \times 64 \times 4$ vs $512 \times 512 \times 3$ )
Faster training and inference
Perceptual compression preserves semantic information

5.3 DiT (Diffusion Transformers)

Replace U-Net with Transformer architecture:

Patching: Split image into patches (like ViT)
Self-attention: Capture long-range dependencies
Scaling: Better performance with larger models
Flexibility: Easy to incorporate conditioning

Result: DiT-XL/2 outperforms U-Net on ImageNet generation.

6. Training and Sampling Algorithms

6.1 Training Loop

# Pseudocode
while not converged:
    x_0 = sample_from_dataset()
    t = sample_uniform(1, T)
    epsilon = sample_normal(0, I)
    
    x_t = sqrt(alpha_bar[t]) * x_0 + sqrt(1 - alpha_bar[t]) * epsilon
    loss = MSE(epsilon, epsilon_theta(x_t, t))
    
    loss.backward()
    optimizer.step()

Training Tricks:

t weighting: Weight loss by $1 / E [∥ ϵ ∥^{2}]$ or use uniform weighting
Architecture: U-Net with attention, group normalization, SiLU activation
EMA: Exponential moving average of model weights for better sampling
Dropout: Apply to attention layers for regularization

6.2 Sampling Loop (DDPM)

x_T = sample_normal(0, I)
for t in reversed(range(1, T+1)):
    z = sample_normal(0, I) if t > 1 else 0
    epsilon = epsilon_theta(x_t, t)
    x_{t-1} = 1/sqrt(alpha_t) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar[t]) * epsilon) + sqrt(beta_t) * z

6.3 Advanced Sampling Methods

Method	Steps	Approach
DDPM	1000	Original stochastic sampling
DDIM	50-100	Deterministic, skip steps
[[DPM-Solver]]	10-20	ODE solver with adaptive steps
[[DPM-Solver]]++	10-15	Improved stability
UniPC	5-10	Unified predictor-corrector
Consistency Models	1-5	Direct mapping, distillation

Predictor-Corrector Framework (for [[Stochastic Differential Equation (SDE)|SDE]]-based models):

Predictor: Take one step using reverse [[Stochastic Differential Equation (SDE)|SDE]]/ODE
Corrector: Apply Langevin dynamics to refine sample
Repeat: Alternate for better quality

# Predictor-Corrector Sampling
for t in reversed(timesteps):
    # Predictor step (Euler-Maruyama)
    score = score_model(x_t, t)
    x_t = x_t + drift(x_t, t) * dt + diffusion(t) * score * dt + noise
    
    # Corrector step (Langevin)
    for _ in range(corrector_steps):
        score = score_model(x_t, t)
        x_t = x_t + step_size * score + sqrt(2 * step_size) * noise

7. Advantages and Disadvantages

Advantages

High generation quality: Reaches or exceeds GAN level
Stable training: No mode collapse problem like GAN
Elegant theory: Based on thermodynamics and [[Stochastic Differential Equation (SDE)|SDE]] mathematical foundation
Flexible conditioning: Easy to incorporate text, image, or other conditions
Coverage: Better mode coverage than GANs (less mode collapse)
Likelihood estimation: Can compute exact likelihoods (via ODE)

Disadvantages

Slow sampling speed: Requires tens to hundreds of iterative steps
Sensitive to hyperparameters: Noise schedule affects generation quality
High computational cost: Training requires significant resources
Blurriness: May produce blurry samples compared to GANs (in pixel space)

Acceleration Methods

Algorithm-Level:

DDIM (Deterministic sampling)
[[DPM-Solver]] (Ordinary differential equation solver)
Progressive Distillation
Consistency Models (One-step generation)

Architecture-Level:

Latent space diffusion (LDM)
Distilled models (smaller, faster)
Quantization and pruning

Hardware-Level:

GPU optimization
Parallel sampling
Mixed precision training

8. Applications in AI Image Generation

Application	Representative Models	Features
Text-to-Image	DALL-E 2, Stable Diffusion, Imagen	Diffusion model + CLIP + Latent space
Image Editing	InstructPix2Pix, Prompt-to-Prompt	Conditional guided editing
Video Generation	Stable Video Diffusion	Introduce temporal dimension
3D Generation	DreamFusion, Magic3D	Score Distillation Sampling (SDS)
Image Super-Resolution	SR3, RePaint	Diffusion + Denoising
Inpainting	Stable Diffusion, GLIDE	Mask-guided generation
Style Transfer	StyleDrop, Custom Diffusion	Style adaptation
Controlled Generation	ControlNet, T2I-Adapter	Spatial control signals

8.1 Text-to-Image Generation

Architecture:

Text Encoder: CLIP, T5, or custom transformer
Conditioning: Cross-attention in U-Net/DiT
Diffusion: Latent space denoising
Decoder: VAE decoder to pixel space

Training Data: LAION-5B, COCO, internal datasets

8.2 Image-to-Image Translation

Given source image $x_{src}$ and target description:

x_{result} = Sample (x_{T} \to x_{0} ∣ x_{src}, text)

Methods:

Img2Img: Add noise to source, then denoise with conditioning
ControlNet: Copy and adapt U-Net weights for control
IP-Adapter: Image prompt adapter for visual conditioning

9. Conditional Diffusion Models

9.1 Classifier Guidance

{\tilde{ϵ}}_{θ} (x_{t}, t, c) = ϵ_{θ} (x_{t}, t) - \sqrt{1 - {\bar{α}}_{t}} \cdot \nabla_{x_{t}} \log p_{ϕ} (c ∣ x_{t})

Pros:

Works with pre-trained classifiers
Flexible guidance strength

Cons:

Requires training separate classifier
Limited to classification conditions

9.2 Classifier-Free Guidance

{\tilde{ϵ}}_{θ} (x_{t}, t, c) = (1 + w) \cdot ϵ_{θ} (x_{t}, t, c) - w \cdot ϵ_{θ} (x_{t}, t)

where $w > 0$ is the guidance strength.

Training: Randomly drop condition (e.g., 10% probability) during training to learn unconditional model.

Pros:

No separate classifier needed
Works with any condition type (text, image, etc.)
Better quality than classifier guidance

Cons:

Requires larger model (learns conditional + unconditional)
Guidance strength $w$ needs tuning

Modern diffusion models support multiple conditions:

Condition Type	Encoding Method	Integration
Text	CLIP, T5 transformer	Cross-attention
Image	CLIP vision, VAE encoder	Concatenation, attention
Depth/Edges	CNN encoder	ControlNet, adapter
Pose/Skeleton	Graph neural network	Spatial injection
Audio	VGGish, CLAP	Cross-attention

9.4 Controllability Methods

ControlNet:

Clone U-Net encoder layers
Train with zero convolution initialization
Lock original model, train control branches

IP-Adapter:

Add image encoder parallel to text encoder
Use decoupled cross-attention
Enables image prompt guidance

10. Theoretical Analysis

10.1 Connection to Variational Inference

Diffusion models optimize the variational lower bound (ELBO):

\log p_{θ} (x_{0}) \geq E_{q} [\log p_{θ} (x_{0} | x_{1})] - \sum_{t = 2}^{T} D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t})) - D_{KL} (q (x_{T} | x_{0}) ∥ p (x_{T}))

Interpretation:

Term 1: Reconstruction loss
Terms 2: Consistency between forward and reverse processes
Term 3: Prior matching (ensure $x_{T}$ is close to Gaussian)

10.2 Connection to Score Matching

Score matching objective:

J (θ) = \frac{1}{2} E_{p (x)} [∥ s_{θ} (x) - \nabla_{x} \log p (x) ∥^{2}]

For diffusion models, this becomes denoising score matching:

J (θ) = \frac{1}{2} \sum_{t = 1}^{T} E_{x_{0}, x_{t}} [∥ s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log q (x_{t} | x_{0}) ∥^{2}]

10.3 Neural Tangent Kernel (NTK) Analysis

In the infinite-width limit, diffusion model training can be analyzed via NTK:

Training dynamics: Governed by kernel regression
Generalization: Related to kernel eigenvalues
Mode coverage: Depends on data spectrum

10.4 Information Bottleneck Perspective

Forward diffusion as information bottleneck:

I (x_{0}; x_{t}) = Information preserved at time t

Early timesteps: High mutual information (preserve details)
Late timesteps: Low mutual information (only semantic info)
Optimal schedule balances compression and preservation

11. Core Formula Cards

[!QUOTE] Reparameterization Noising
$x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ$

[!QUOTE] DDPM Loss
$L_{simple} = ∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}$

[!QUOTE] DDIM Sampling
$x_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}, t)}{\sqrt{{\bar{α}}_{t}}}) + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} ϵ_{θ} (x_{t}, t) + σ_{t} z$

[!QUOTE] Classifier-Free Guidance
$\tilde{ϵ} = ϵ_{θ} (x_{t}, t, \emptyset) + w \cdot (ϵ_{θ} (x_{t}, t, c) - ϵ_{θ} (x_{t}, t, \emptyset))$

12. Evaluation Metrics

12.1 Sample Quality

Metric	Description	Range
FID	Fréchet Inception Distance	Lower is better (0 is perfect)
IS	Inception Score	Higher is better
Precision/Recall	Quality vs. diversity trade-off	[0, 1]
KID	Kernel Inception Distance	Lower is better

FID Formula:

FID = ∥ μ_{r} - μ_{g} ∥^{2} + Tr (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{1 / 2})

where $μ_{r}, Σ_{r}$ are real data statistics and $μ_{g}, Σ_{g}$ are generated statistics.

12.2 Likelihood Evaluation

Bits per dimension (bpd):

bpd = - \frac{\log_{2} p_{θ} (x)}{dim (x)}

Lower bpd indicates better likelihood.

12.3 Diversity Metrics

Mode coverage: Percentage of data modes captured
LPIPS: Learned Perceptual Image Patch Similarity (diversity)
Unique samples: Ratio of unique generated samples

12.4 Human Evaluation

User studies: Preference ratings
Text-image alignment: CLIP score for text-to-image
Aesthetic quality: Aesthetic score predictors

13. Practical Implementation Tips

13.1 Network Architecture

U-Net Design:

Input
  ↓
Downsample Block 1 (128 channels)
  ↓
Downsample Block 2 (256 channels)
  ↓
Downsample Block 3 (512 channels)
  ↓
Middle Block with Attention (1024 channels)
  ↓
Upsample Block 3 (512 channels) + Skip Connection
  ↓
Upsample Block 2 (256 channels) + Skip Connection
  ↓
Upsample Block 1 (128 channels) + Skip Connection
  ↓
Output (3 channels)

Key Components:

ResNet blocks: Groups of 2-3 conv layers with skip connections
Attention: Self-attention at lowest resolution (e.g., 32x32)
Time embedding: Sinusoidal position encoding → MLP
Conditioning: Cross-attention for text, AdaGN for class labels

13.2 Training Best Practices

Hyperparameters:

Parameter	Recommended Value	Notes
Timesteps	1000	Standard, can use fewer for fast sampling
Batch size	256-512	Larger is better if memory allows
Learning rate	1e-4	Use cosine decay schedule
Optimizer	AdamW	β₁=0.9, β₂=0.999
EMA rate	0.9999	Exponential moving average
Gradient clipping	1.0	Prevents explosion

Data Augmentation:

Random horizontal flip
Random crop and resize
No color jitter (changes data distribution)

13.3 Debugging Checklist

✓ Check noise schedule: Plot ${\bar{α}}_{t}$ vs $t$ , ensure smooth decay
✓ Monitor loss curves: Should decrease smoothly, no spikes
✓ Validate reparameterization: $x_{t}$ should match theoretical distribution
✓ Test sampling: Start with small model, verify basic functionality
✓ Check gradients: Norm should be reasonable (< 10)
✓ Visualize intermediates: Sample at different timesteps during training

13.4 Common Issues and Solutions

Problem	Cause	Solution
Blurry samples	Undertraining, high noise	Train longer, check schedule
Mode collapse	Low capacity, overfitting	Increase model size, add dropout
Training instability	High learning rate	Reduce LR, add gradient clipping
Slow sampling	Too many timesteps	Use DDIM, [[DPM-Solver]]
Poor conditioning	Weak guidance	Increase guidance strength $w$

14. Recent Advances (2023-2024)

14.1 Consistency Models

Key Idea: Learn direct mapping from noise to data in one step.

f_{θ} (x_{t}, t) \approx x_{0} \forall t